This script is a supplement to Thielking (2019) and extends this work on /s/-retraction in Glasgow by including dynamic articulatory measurements of the lips from lip profile video, and of the tongue using ultrasound imaging.
Dynamic measures of lip protrusion were generated using the following procedure. While in Thielking (2019) lip protrusion was based on manual annotation of the lips at a single point, this work, inspired by King and Ferragne (2020a), uses deep learning to automatically segment the lips from the relevant lip video frames across the entire duration of the sibilants.
King and Ferragne (2020a) showed that the lips in /r/ and /w/ can be successfully segmented automatically with the help of Convolution Neural Networks (CCN) and transfer learning from the field of image recognition. Similarly, the present work aimed to implement a deep learning approach on grayscale profile view images of the lips of the sibilants /s/ and /∫/ in various contexts. Instead of training a CNN model from scratch, which requires large amounts of training data, transfer learning makes use of pre-trained models of a source domain by adopting these models to a new use case. In the current study, transfer learning was implemented with the help of the PixelLib python package. The MASK-RCNN model was trained for 50 epochs with a batch size of 4 using the Resnet-101 backbone on Google Colab. In the case of the present work, inference on images of lips resulted in the output of an image with the segmented lips and a corresponding binary image of the segmentation mask.
The training and validation data sets consist of 200 manually annotated images of the lip area. To improve segmentation accuracy, images of this dataset were taken from all speakers and different points of the sibilant duration as well as various sibilant contexts. In contrast to King and Ferragne (2020a), segmentation was not limited to representative midpoint frames of the sibilants, but automatic segmentation was extended to all frames across the sibilant duration, resulting in the segmentation of ~14000 images.
After training the model achieved a mAP(Mean Average Precision) of 0.89.
Work flow
After all 14000 lip video frames were automatically segmented, the segmentations masks were extracted. In addition to the automatic segmentation of the lips, lip protrusion was also measured automatically with the help of the image processing tool OpenCV. Lip protrusion was measured following the procedure laid out in Lawson, Stuart-Smith, and Rodger (2019) and King and Ferragne (2020b) with only minor changes. First, a horizontal fiducial line was placed so that it intersected the participant’s corner of the mouth (see Figure 2.1. A second vertical fiducial line was positioned touching both the edges of the upper and lower lips. The length of the lips was thereafter measured along the horizontal fiducial at the intersection of these two fiducial lines. While the horizontal fiducial line was kept constant in all recordings, the vertical fiducial was adapted to the lip positioning, resulting in an increase or decrease of lip length along the horizontal fiducial. Lip length was measured in pixels. The edges of the upper and lower lips were automatically identified from the segmentation masks using OpenCV’s findContours(), convexityDefects() and convexHull() functions.
Lip protrusion measurements were z-scored for inter-speaker comparability.
Figure 2.1: Automatic segmentation of the mouth (in blue) via semantic segmentation using a CNN (left). Extraction of relevant lip landmarks and plotting of fiducials in segmentation mask (right).
The video above shows a video of the word street produced by speaker F01 and the corresponding automatic segmentation of the lips as well as measured lip protrusion by frame. Note: speed has been reduced to 50%.
Table 2.1 shows the mean number of video frames captured along the duration of the sibilant for each speaker.
| speaker | Mean_Frame_No | SD_Frame_No |
|---|---|---|
| F01 | 5.344262 | 1.0252619 |
| F02 | 6.021858 | 1.3944800 |
| F03 | 4.896175 | 1.4769796 |
| F04 | 4.699454 | 1.3995092 |
| F05 | 3.273224 | 0.8781186 |
| F06 | 6.453552 | 1.2565328 |
| F07 | 6.005465 | 1.0403786 |
| F08 | 3.551913 | 1.1075585 |
| F09 | 3.557377 | 1.2024415 |
| M01 | 3.934426 | 0.9585162 |
| M02 | 6.054645 | 0.9818506 |
| M03 | 4.125683 | 0.8191033 |
| M04 | 5.333333 | 1.0181503 |
| M05 | 4.240437 | 1.2911805 |
| M06 | 4.551913 | 1.2297968 |
| M07 | 4.622951 | 1.7399375 |
To verify the automatically segmented lip measurements a Pearson correlation test comparing these measurements to the manually annotated measurements in Thielking (2019) was conducted showing: \(r=0.41, p(one-tailed)<0.001\). Given this relatively low correlation, it has to be noted that Thielking (2019) annotated the maximum protrusion in the central portion of the sibilant, while the automatic method measured lip protrusion at the midpoint of the sibilant.
Plotting these midpoint protrusion measures, however, shows a similar pattern as in Thielking (2019). As can be seen in Figure 2.2, /∫/ displays the largest amount of lip protrusion of the four contexts under investigation. Furthermore, the two clusters /str/ and /stj/ show more protrusion than pre-vocalic /s/.
As expected, taking into account the following vowel (Fig. 2.3) shows that sibilants followed by rounded vowels display more lip protrusion compared to unrounded vowels. This effect seems to be largest for /sV/. Note that /stj/ only occurs in rounded vowel contexts.
Figure 2.2: Z-scored lip protrusion by context across all speakers.
Figure 2.3: Midpoint lip protrusion for various contexts and following vowel
Running a linear mixed effects model reveals statistically significant differences between /sV/ and /stj/ and /∫/. There is no significant difference between /sV/ and /str/.
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: z_lp ~ context * vowel + (1 | word) + (1 | speaker)
## Data: lips
##
## REML criterion at convergence: 12967.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -9.1188 -0.5209 0.0506 0.5847 15.8657
##
## Random effects:
## Groups Name Variance Std.Dev.
## word (Intercept) 0.01506 0.1227
## speaker (Intercept) 0.01084 0.1041
## Residual 0.70208 0.8379
## Number of obs: 5187, groups: word, 22; speaker, 16
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.08154 0.08138 16.93584 1.002 0.330461
## contextstr 0.17691 0.12330 14.51062 1.435 0.172561
## contextstj 0.43704 0.10282 14.31909 4.250 0.000770 ***
## contextsh 0.90921 0.10867 13.67888 8.367 9.54e-07 ***
## vowelunrounded -1.14858 0.10900 13.84618 -10.538 5.40e-08 ***
## contextstr:vowelunrounded 0.83104 0.16046 14.39894 5.179 0.000128 ***
## contextsh:vowelunrounded 0.55639 0.15391 13.76184 3.615 0.002886 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) cntxtstr cntxtstj cntxtsh vwlnrn cntxtst:
## contextstr -0.592
## contextstj -0.710 0.469
## contextsh -0.672 0.443 0.532
## vowelunrndd -0.670 0.442 0.530 0.502
## cntxtstr:vw 0.455 -0.768 -0.360 -0.341 -0.679
## cntxtsh:vwl 0.474 -0.313 -0.375 -0.706 -0.708 0.481
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
Tukey-corrected pairwise comparisons reveal significant differences between unrounded and rounded following vowels, in that rounded vowels show more protrusion than unrounded vowels in /sV/ and /∫/. However, there is no significant effect for /str/.
## $`emmeans of context, vowel`
## context vowel emmean SE df asymp.LCL asymp.UCL
## sV rounded 0.0815 0.0814 Inf -0.078 0.2410
## str rounded 0.2584 0.0997 Inf 0.063 0.4539
## stj rounded 0.5186 0.0729 Inf 0.376 0.6615
## sh rounded 0.9907 0.0809 Inf 0.832 1.1494
## sV unrounded -1.0670 0.0814 Inf -1.227 -0.9075
## str unrounded -0.0591 0.0727 Inf -0.202 0.0834
## stj unrounded nonEst NA NA NA NA
## sh unrounded 0.3986 0.0814 Inf 0.239 0.5580
##
## Degrees-of-freedom method: asymptotic
## Confidence level used: 0.95
##
## $`pairwise differences of context, vowel`
## 1 estimate SE df z.ratio p.value
## sV rounded - str rounded -0.177 0.1233 Inf -1.435 0.8412
## sV rounded - stj rounded -0.437 0.1028 Inf -4.250 0.0006
## sV rounded - sh rounded -0.909 0.1087 Inf -8.367 <.0001
## sV rounded - sV unrounded 1.149 0.1090 Inf 10.538 <.0001
## sV rounded - str unrounded 0.141 0.1027 Inf 1.370 0.8712
## sV rounded - stj unrounded nonEst NA NA NA NA
## sV rounded - sh unrounded -0.317 0.1090 Inf -2.908 0.0709
## str rounded - stj rounded -0.260 0.1179 Inf -2.207 0.3474
## str rounded - sh rounded -0.732 0.1230 Inf -5.953 <.0001
## str rounded - sV unrounded 1.325 0.1233 Inf 10.750 <.0001
## str rounded - str unrounded 0.318 0.1177 Inf 2.697 0.1234
## str rounded - stj unrounded nonEst NA NA NA NA
## str rounded - sh unrounded -0.140 0.1233 Inf -1.136 0.9489
## stj rounded - sh rounded -0.472 0.1025 Inf -4.608 0.0001
## stj rounded - sV unrounded 1.586 0.1028 Inf 15.420 <.0001
## stj rounded - str unrounded 0.578 0.0961 Inf 6.011 <.0001
## stj rounded - stj unrounded nonEst NA NA NA NA
## stj rounded - sh unrounded 0.120 0.1028 Inf 1.167 0.9412
## sh rounded - sV unrounded 2.058 0.1087 Inf 18.937 <.0001
## sh rounded - str unrounded 1.050 0.1023 Inf 10.259 <.0001
## sh rounded - stj unrounded nonEst NA NA NA NA
## sh rounded - sh unrounded 0.592 0.1087 Inf 5.450 <.0001
## sV unrounded - str unrounded -1.008 0.1027 Inf -9.816 <.0001
## sV unrounded - stj unrounded nonEst NA NA NA NA
## sV unrounded - sh unrounded -1.466 0.1090 Inf -13.446 <.0001
## str unrounded - stj unrounded nonEst NA NA NA NA
## str unrounded - sh unrounded -0.458 0.1027 Inf -4.457 0.0002
## stj unrounded - sh unrounded nonEst NA NA NA NA
##
## Degrees-of-freedom method: asymptotic
## P value adjustment: tukey method for comparing a family of 8 estimates
Now, turning to the dynamic lip protrusion trajectories gives some hints as to why Thielking (2019) found a statistically significant difference between /sV/ and /str/. Figure 2.4 shows that lip protrusion in /str/ contexts increases at the latter part of the central portion of the fricative, roughly around 70% of the duration, while /sV/ remains rather stable. The contexts /stj/ and /str/ behave similarly in that they show a continuous increase in lip protrusion. However, overall /stj/ displays more protrusion than /str/ and is closer to /∫/. /∫/ has the largest amount of lip protrusion and reaches its maximum slightly after the midpoint of the fricative at around 60%.
Figure 2.4: Dynamic lip protrusion across sibilant duration for various contexts
Running a GAMM analysis and plotting the results 2.5 confirms the results plotted in 2.4. (However, I’m not sure how to fit the GAMM correctly since the number of data points varys significantly between the different trajectories and k has to be fit to number of data points (i.e. number of frames) - 1)
Figure 2.5: Results of Generalised Additive Mixed Effects model
As can be seen in Figure 2.6 there are striking differences in the trajectories of lip protrusion between sibilants followed by rounded and unrounded vowels. As expected rounded contexts exhibit significantly more lip protrusion than unrounded contexts. Especially, /sV/ shows major differences in terms of lip protrusion dynamics. While lip protrusion remains rather stable in unrounded contexts, the trajectory in rounded contexts resembles that of /str/ and /stj/ showing an increase in protrusion up to 75% of the sibilant before it starts to taper off. ultr
Figure 2.6: Dynamic lip protrusion across sibilant duration by following vowel
Looking at each speaker individually reveals that the speakers differ in terms of the dynamics of lip protrusion. Some speakers show almost similar lip protrusion trajectories in /str/, /stj/ and /∫/ (M02), while others show a split between /∫,stj/ and /sV, str/ (F07, F08). There are also differences in terms of protrusion dynamics while some speakers show rather flat i.e. unchanging lip configurations, some show an increase in protrusion throughout the entire sibilant.
Figure 2.7: Dynamic lip protrusion across sibilant duration for various contexts and speakers
Figure 2.8: Static lip protrusion at sibilant midpoint for all contexts across all speakers
Figure 2.8 shows z-scored lip protrusion at sibilant midpoint across all contexts.
Figure 2.9: Dynamic lip protrusion across sibilant duration for all contexts across all speakers
Figure 2.9 shows lip protrusion in all contexts investigated. As expected /∫r/ and /t∫/ pattern with /∫/ in that they show the highest degree of lip protrusion. The other sibilant contexts /sp/, /sk/ and /st/ pattern with pre-vocalic /s/. While /spr/ and /skr/ show slightly more lip protrusion, this protrusion is however lower than /str/ and /stj/.
Taking into account the following vowel, however, reveals an interesting pattern see (Figure 2.10) . In unrounded vowel contexts, /skr/ and /spr/ are shifted in the direction of /∫/ and show similar protrusion as /str/. In following rounded contexts, this effect is smaller as /skr/ and /spr/ pattern with their non-rhotic counterparts. Interestingly, /∫r/ shows a similar degree of protrusion in both contexts, while /∫V/ has less protrusion in the unrounded context. This could likely be due to the influence of the rhotic on lip protrusion. Looking at the tongue posture might provide some insights into the relationship of type of rhotic and lip protrusion. As found by King and Ferragne (2020b): bunched /r/ shows more protrusion than retroflex/tip-up /r/. Previous studies on /s/-retraction suggest a similar relationship: bunchers show more retraction likely due to more lip protrusion i.e. lip protrusion might be more important than lingual configuration???
Figure 2.10: Dynamic lip protrusion across sibilant duration for all contexts and following vowel across all speakers
Midsagittal tongue ultrasound imaging was used to look at differences in tongue shape. In order to look capture and quantify dynamic differences in tongue shape Principal Components Analysis (PCA) and Linear Discriminant Analysis (LDA) was applied to the ultrasound images (see Smith et al. (2019), Faytak, Liu, and Sundara (2020) and Strycharczuk and Sebregts (2018) for a similar approach).
Processing of the ultrasound images followed Faytak, Liu, and Sundara (2020) using functions from the Python package SATKit (Faytak, Moisik, and Palo (2020)). First a series of series of filtering operations to reduce speckle noise in the ultrasound signal to improve the signal-to-noise ratio for the analysis was applied to all relevant sibilant ultrasound frames (see Carignan (2014)). Figure 2.11 shows an unprocessed ultrsound frame and Figure 2.12 shows the same frame after filtering and resizing. The processed frames were then submitted to PCA for each speaker individually, taking the pixel-values as input. The first 50 PCA scores were retained, which capture at least 80% of the variance in each speaker’s uti frames. “For each speaker, the scores for these PCs and the target of each token, i.e., /sV/, /str/, /stj/ ,/∫/, were submitted to LDA. The resulting linear discriminant (LD) score, when normalized to a [-1,1] range for all speakers, may be taken as an index of how distinctly /sV/- or /∫/-like the sibilant in each token is: The LDA was structured such that /∫/ was consistently near -1, and /sV/ was consistently near 1, for all speakers.” (Faytak, Liu, and Sundara (2020)).
Figure 2.11: Unprocessed UTI frame by speaker M06.
Figure 2.12: Filtered UTI frame by speaker M06.
Figure 2.13 shows the results of the LDA for the four contexts /sV/, /str/, /stj/ and /∫/. As can be seen, across all speakers, /str/ and /stj/ have distinct lingual configurations from both /sV/ and /∫/. /stj/ is however closer to /∫/ than /str/, reflecting the higher /∫/-likeness of /stj/ in lip rounding and acoustics. Both /str/ and /stj/ become more /∫/-like towards the end of the sibilant.
Figure 2.13: LD score trajectories across sibilant duration
Figure 2.14: LD score trajectories across sibilant duration by following vowel
Figure 2.15: LDA score trajectories across sibilant duration by speaker
Figure 2.16: LD score of central portion of sibilants across all speakers
Figure 2.17: LD score of central portion of sibilants for all speakers
In a recent paper Wrench and Balch-Tomes (2022) leverage the power of the DeepLabCut suite to automatically track the tongue in ultrasound and lips in frontal view lip videos. DeepLabCut (DLC) uses deep learning and transfer learning to “perform markerless estimation of speech articulator keypoints using only a few hundred hand-labelled images as training input”. Their models achieve similar marker estimation performance of keypoints as data labelled by human annotators (see Wrench and Balch-Tomes (2022) for more details). Using this fully automatic method has great potential for the ultrasound data of the current study. As described in Turton (2017) applying PCA and LDA to image data can cause problems when interpreting the PCA scores in terms of an articulatory tongue to PC score mapping because of the messiness of the pixel image data. Applying PCA and LDA to tongue splines might therefore provide a better mapping PC score to tongue configuration due to significantly less messiness in the spline data (see Turton (2017)).
Running DLC on the current ultrasound image data reveals promising results. Preliminary plotting of automatically tracked tongues spline shows only minor problems for some speakers and/or tokens. Speakers F03, F08, F09 and M04 were excluded from the analysis due to insufficient quality of their ultrasound imaging, as can be seen the model also had difficulties tracking the tongue in these speakers (see Figure 2.18). Overall the results of the algorithm appear reasonable. Compared to the manually annotated tongue spline data in Thielking (2019) there seem to be only minor differences. In all speakers /∫/ shows more bunching than /sV/. The relationships between the four contexts for all speakers are comparable to the GAMM-smoothed splines reported in Thielking (2019).
PCA and LDA analysis on this data can therefore provide valuable insight into the relationship of lingual configuration and /s/-retraction as well as teasing apart the influence of the lips and tongue on this phenomenon. This analysis can also shed light on the usefulness of PCA analysis on image data compared to automatically/manually annotated spline data and the interpretability of such analyses in terms of a mapping between articulation and their quantitative measures.
Figure 2.18: LDC tracked splines
Figure 2.19: LDC tracked mean splines for each context